Payment Delays model

Imports

Constants

Reading the data

Check the data

Look at the first 10 data features

Convert "yes" and "no" to integers 1 and 0

Let's have a more detailed look

We have 3000 distinct clients.

13,7% out of the dataset are delayed payments.

A min value of 1 for account length is unusual.

Check for missing data

There is no missing data in the entire dataset.

Data exploration

Corelation

Check for data corelation

We notice:

  1. Possitive correlation between total_day_charge and total_day_minutes
  2. Possitive correlation between total_eve_charge and total_eve_minutes
  3. Possitive correlation between total_night_charge and total_night_minutes
  4. Possitive correlation between total_intl_charge and total_intl_minutes

As a result we will not use total_day_minutes, total_eve_minutes, total_night_minutes and total_intl_minutes in order to reduce the model dimensionality.

Our data set now looks like this:

Account_lenght analysis

I have some suspicions regarding the account_length but plotting a histogram on the length of it does show that is fairly well distributed across the data set, so I will not remove any entries on this criteria.

Also the length does not seem to influence the outcome in any way, nor is corelated with th

We will assume that payment delays is not influenced by state, area code or account lenght

payment_delay vs. international_plan

We can see that payment delays is more frequent if the person also has an International plan.

payment_delay vs. voice_mail_plan

We can see that payment delays is more frequent if the person does not have a voice mail plan.

Charge vs payment delays

It looks like payment delays are more frequent for users who use the voice plan more, regardless of the time of the day.

This difference is more noticeable in the case of day_charge.

Predictive models

Predictors and desired values

Split data in train and validation set

We also need to make copies of those splits for later usages

Random Forest Classifier

Features importance

It looks total_day_charge is rated as the most important feature by Random Forest, followed by total_evening_charge and number_customer_service_calls.

Confusion matrix

We also calculate area under curve (receiver operator characteristic)

The ROC-AUC score obtained with RandomForrestClassifier is 0.835.

Adaptive Boosting Classifier

We fit the model

Predicted values and feature importance

The ROC-AUC score obtained with AdaBoostClassifier is 0.641.

XGBoost

Preparing the model

Training phase

Variable Importance

Out of all the classifiers using decision trees it looks like the Gradient Boosting method has the best accuracy, while also giving a probability distribution at its output, which will come in handy when trying to rank all the clients that will delay their payments.

Neural network approach

Preparing the data

For a neural network we will need to balance both classes

Defining the model

Given the current problem, it seems that a neural network is not the way to go.

We will use the XGBoost classifier to rank top 300 clients

Ranking clients who are prone to delay their payments